Install and/or load packages for this session

  • install.packages(): “Download and install packages from CRAN-like repositories or from local files.”
  • library(): loads your package you installed
  • Best Practice Note. install.packages() downloads files from the internet, so it’s “polite” to “comment out” (# = comment) this code before you share it with people so they don’t download files they don’t really want by accident.
# install.packages("tidyverse")
# install.packages("psych")
# install.packages("datasauRus")
# install.packages("lme4")
# install.packages("fivethirtyeight")

library(tidyverse)
library(psych)
library(datasauRus)
library(lme4)
library(fivethirtyeight)

Anscombe’s quartet

Same descriptives …

describe(anscombe)
vars n mean sd median trimmed mad min max range skew kurtosis se
x1 1 11 9.000000 3.316625 9.00 9.000000 4.447800 4.00 14.00 10.00 0.0000000 -1.5289256 1.0000000
x2 2 11 9.000000 3.316625 9.00 9.000000 4.447800 4.00 14.00 10.00 0.0000000 -1.5289256 1.0000000
x3 3 11 9.000000 3.316625 9.00 9.000000 4.447800 4.00 14.00 10.00 0.0000000 -1.5289256 1.0000000
x4 4 11 9.000000 3.316625 8.00 8.000000 0.000000 8.00 19.00 11.00 2.4669110 4.5206612 1.0000000
y1 5 11 7.500909 2.031568 7.58 7.490000 1.823598 4.26 10.84 6.58 -0.0483735 -1.1991228 0.6125408
y2 6 11 7.500909 2.031657 8.14 7.794444 1.467774 3.10 9.26 6.16 -0.9786929 -0.5143191 0.6125676
y3 7 11 7.500000 2.030424 7.11 7.152222 1.527078 5.39 12.74 7.35 1.3801204 1.2400439 0.6121958
y4 8 11 7.500909 2.030578 7.04 7.195556 1.897728 5.25 12.50 7.25 1.1207739 0.6287512 0.6122425

Same linear regression results …

summary(lm(y1 ~ x1, data = anscombe))
## 
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
summary(lm(y2 ~ x2, data = anscombe))
## 
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
summary(lm(y3 ~ x3, data = anscombe))
## 
## Call:
## lm(formula = y3 ~ x3, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
summary(lm(y4 ~ x4, data = anscombe))
## 
## Call:
## lm(formula = y4 ~ x4, data = anscombe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## x4            0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

But the plots …

tibble(set = rep(1:4, each = nrow(anscombe)) %>% factor(),
       x = with(anscombe, c(x1, x2, x3, x4)),
       y = with(anscombe, c(y1, y2, y3, y4))) %>% 
  ggplot(mapping = aes(x = x, y = y, color = set)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(facets = ~ set) +
  theme(legend.position = "none")

Datasaurus Dozen

Same descriptives …

with(datasaurus_dozen, describeBy(x, group = dataset, mat = TRUE))
item group1 vars n mean sd median trimmed mad min max range skew kurtosis se
X11 1 away 1 142 54.26610 16.76983 53.34030 54.35902 21.22096 15.56075 91.63996 76.07921 -0.0238768 -1.2428294 1.407292
X12 2 bullseye 1 142 54.26873 16.76924 53.84209 53.86032 17.49410 19.28820 91.73554 72.44733 0.1682837 -0.2682028 1.407243
X13 3 circle 1 142 54.26732 16.76001 54.02321 54.07580 16.29359 21.86358 85.66476 63.80118 0.0304864 -0.6690924 1.406469
X14 4 dino 1 142 54.26327 16.76514 53.33330 53.68646 15.96656 22.30770 98.20510 75.89740 0.2807568 -0.2854912 1.406899
X15 5 dots 1 142 54.26030 16.76774 50.97677 54.87898 2.76161 25.44353 77.95444 52.51091 -0.1226525 -0.8711817 1.407117
X16 6 h_lines 1 142 54.26144 16.76590 53.06968 53.87282 17.90469 22.00371 98.28812 76.28441 0.2408185 -0.3549513 1.406963
X17 7 high_lines 1 142 54.26881 16.76670 54.16869 53.91367 16.00184 17.89350 96.08052 78.18702 0.1891617 -0.2816209 1.407031
X18 8 slant_down 1 142 54.26785 16.76676 53.13516 53.83147 16.23906 18.10947 95.59342 77.48394 0.2521441 -0.2597905 1.407035
X19 9 slant_up 1 142 54.26588 16.76885 54.26135 53.68574 16.12922 20.20978 95.26053 75.05075 0.2299168 -0.3829457 1.407211
X110 10 star 1 142 54.26734 16.76896 56.53473 54.33442 22.16494 27.02460 86.43590 59.41129 0.0814822 -1.1882423 1.407220
X111 11 v_lines 1 142 54.26993 16.76996 50.36289 53.56542 28.37296 30.44965 89.50485 59.05520 0.2561127 -0.5817808 1.407304
X112 12 wide_lines 1 142 54.26692 16.77000 64.55023 54.70481 11.67615 27.43963 77.91587 50.47624 -0.3212056 -1.7192867 1.407307
X113 13 x_shape 1 142 54.26015 16.76996 47.13646 53.26637 14.75946 31.10687 85.44619 54.33932 0.4776030 -1.3729389 1.407304
with(datasaurus_dozen, describeBy(y, group = dataset, mat = TRUE))
item group1 vars n mean sd median trimmed mad min max range skew kurtosis se
X11 1 away 1 142 47.83472 26.93974 47.53527 48.23284 35.89123 0.0151193 97.47577 97.46065 -0.0932474 -1.2817225 2.260733
X12 2 bullseye 1 142 47.83082 26.93573 47.38294 47.82848 36.83154 9.6915471 85.87623 76.18468 0.0122276 -1.4540983 2.260396
X13 3 circle 1 142 47.83772 26.93004 51.02502 47.71135 41.71674 16.3265464 85.57813 69.25159 0.0048960 -1.8029697 2.259918
X14 4 dino 1 142 47.83225 26.93540 46.02560 46.90284 30.79242 2.9487000 99.48720 96.53850 0.2472603 -1.0635521 2.260369
X15 5 dots 1 142 47.83983 26.93019 51.29929 46.87745 46.98792 15.7718920 94.24933 78.47744 0.1476910 -1.4088664 2.259931
X16 6 h_lines 1 142 47.83025 26.93988 50.47353 47.23323 29.62371 10.4639152 90.45894 79.99502 0.1487355 -1.1603030 2.260744
X17 7 high_lines 1 142 47.83545 26.94000 32.49920 47.15084 22.76810 14.9139625 87.15221 72.23825 0.2024742 -1.8022348 2.260754
X18 8 slant_down 1 142 47.83590 26.93610 46.40131 47.30665 28.64590 0.3038724 99.64418 99.34031 0.1873369 -0.9492562 2.260428
X19 9 slant_up 1 142 47.83150 26.93861 45.29224 46.83068 33.79656 5.6457775 99.57959 93.93381 0.2554495 -1.0933159 2.260638
X110 10 star 1 142 47.83955 26.93027 50.11055 46.55526 37.29715 14.3655905 92.21499 77.84940 0.2372490 -1.3589160 2.259938
X111 11 v_lines 1 142 47.83699 26.93768 47.11362 46.82624 33.00178 2.7347602 99.69468 96.95992 0.2253746 -1.0479510 2.260560
X112 12 wide_lines 1 142 47.83160 26.93790 46.27933 46.96953 32.58742 0.2170063 99.28376 99.06676 0.2484350 -1.0901205 2.260578
X113 13 x_shape 1 142 47.83972 26.93000 39.87621 46.97941 33.04887 4.5776614 97.83761 93.25995 0.2277144 -1.4442001 2.259915

Same linear regression results …

summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "away", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "away", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.481 -24.474   0.422  24.196  49.537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.4251     7.6932   6.944 1.31e-10 ***
## x            -0.1030     0.1355  -0.760    0.448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.98 on 140 degrees of freedom
## Multiple R-squared:  0.004112,   Adjusted R-squared:  -0.003001 
## F-statistic: 0.5781 on 1 and 140 DF,  p-value: 0.4483
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "bullseye", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "bullseye", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.506 -23.548  -1.602  22.128  38.384 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.8095     7.6904   6.997 9.92e-11 ***
## x            -0.1102     0.1354  -0.813    0.417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.97 on 140 degrees of freedom
## Multiple R-squared:  0.004704,   Adjusted R-squared:  -0.002405 
## F-statistic: 0.6617 on 1 and 140 DF,  p-value: 0.4173
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "circle", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "circle", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.649 -28.873   3.759  29.968  37.582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.7970     7.6925   6.993 1.01e-10 ***
## x            -0.1098     0.1355  -0.811    0.419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 140 degrees of freedom
## Multiple R-squared:  0.004671,   Adjusted R-squared:  -0.002439 
## F-statistic: 0.657 on 1 and 140 DF,  p-value: 0.419
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "dino", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "dino", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.439 -23.213  -1.296  21.368  52.050 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.4530     7.6934   6.948 1.29e-10 ***
## x            -0.1036     0.1355  -0.764    0.446    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.98 on 140 degrees of freedom
## Multiple R-squared:  0.004157,   Adjusted R-squared:  -0.002957 
## F-statistic: 0.5844 on 1 and 140 DF,  p-value: 0.4459
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "h_lines", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "h_lines", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.631 -19.015   1.345  22.421  44.139 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.21109    7.69546   6.915 1.53e-10 ***
## x           -0.09916    0.13554  -0.732    0.466    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.98 on 140 degrees of freedom
## Multiple R-squared:  0.003809,   Adjusted R-squared:  -0.003307 
## F-statistic: 0.5353 on 1 and 140 DF,  p-value: 0.4656
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "high_lines", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "high_lines", ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.37 -24.92 -13.62  28.96  41.13 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.8088     7.6927   6.995    1e-10 ***
## x            -0.1101     0.1355  -0.812    0.418    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.97 on 140 degrees of freedom
## Multiple R-squared:  0.004693,   Adjusted R-squared:  -0.002417 
## F-statistic: 0.6601 on 1 and 140 DF,  p-value: 0.4179
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "slant_down", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "slant_down", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.855 -20.283  -1.849  20.947  51.612 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.8497     7.6912   7.001 9.69e-11 ***
## x            -0.1108     0.1355  -0.818    0.415    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.97 on 140 degrees of freedom
## Multiple R-squared:  0.004758,   Adjusted R-squared:  -0.002351 
## F-statistic: 0.6693 on 1 and 140 DF,  p-value: 0.4147
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "slant_up", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "slant_up", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.039 -22.605  -2.074  23.669  53.404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.8126     7.6910   6.997 9.93e-11 ***
## x            -0.1102     0.1355  -0.814    0.417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.97 on 140 degrees of freedom
## Multiple R-squared:  0.004707,   Adjusted R-squared:  -0.002402 
## F-statistic: 0.6621 on 1 and 140 DF,  p-value: 0.4172
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "v_lines", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "v_lines", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.605 -24.261  -1.092  18.186  51.402 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.8908     7.6903   7.008 9.38e-11 ***
## x            -0.1116     0.1354  -0.824    0.412    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.97 on 140 degrees of freedom
## Multiple R-squared:  0.004823,   Adjusted R-squared:  -0.002286 
## F-statistic: 0.6784 on 1 and 140 DF,  p-value: 0.4115
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "wide_lines", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "wide_lines", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.220 -23.863  -0.718  20.934  52.676 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.6349     7.6915   6.973 1.12e-10 ***
## x            -0.1069     0.1355  -0.789    0.431    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.97 on 140 degrees of freedom
## Multiple R-squared:  0.004432,   Adjusted R-squared:  -0.002679 
## F-statistic: 0.6233 on 1 and 140 DF,  p-value: 0.4312
summary(lm(y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == "x_shape", ]))
## 
## Call:
## lm(formula = y ~ x, data = datasaurus_dozen[datasaurus_dozen$dataset == 
##     "x_shape", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.381 -24.631  -8.142  26.292  50.807 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.5542     7.6889   6.965 1.17e-10 ***
## x            -0.1053     0.1354  -0.778    0.438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.97 on 140 degrees of freedom
## Multiple R-squared:  0.004301,   Adjusted R-squared:  -0.002811 
## F-statistic: 0.6048 on 1 and 140 DF,  p-value: 0.4381

But the plots …

ggplot(data = datasaurus_dozen, aes(x = x, y = y, color = dataset)) +
  geom_point() +
  theme(legend.position = "none") +
  facet_wrap(~dataset, ncol = 3)

How do I answer research questions with plots?

  • Use ggplot2.
  • ggplot2 metaphor. Think of a ggplot2 plot like a “map of the world” that you build one layer at a time.
  • Imagine a map of the world without land (i.e., water only).
  • Now imagine adding “layers” (e.g., continents, colors, mountains, legends). ggplot2’s kinda like that.

How do I answer research questions with plots?

  • With ggplot2, you build plots step-by-step, layer-by-layer. this system is based on the grammar of graphics: statistical graphics map data onto perceivable aesthetic attributes (e.g., position, color, shape, size, line type) of geometric objects (e.g., points, bars, lines).

Source. grammar of graphics

How do I answer research questions with plots?

  1. grammar of graphics, a visualization philosophy
  2. Start with ggplot()
  3. Supply a dataset and aesthetic mapping, aes()
  4. Add on …
    • Layers, like geom_point() or geom_histogram()
    • Scales, like scale_colour_brewer()
    • Faceting Specifications, like facet_wrap()
    • Coordinate Systems, like coord_flip()

Source. ggplot2.tidyverse.org

Example: Do cars with big engines use more fuel than cars with small engines?

  • mpg dataset. “This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.”

Sources. R for Data Science and help("mpg")

mpg %>% sample_n(size = 15)
manufacturer model displ year cyl trans drv cty hwy fl class
toyota toyota tacoma 4wd 2.7 2008 4 manual(m5) 4 17 22 r pickup
toyota 4runner 4wd 4.7 2008 8 auto(l5) 4 14 17 r suv
toyota toyota tacoma 4wd 3.4 1999 6 manual(m5) 4 15 17 r pickup
dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 14 19 r pickup
subaru forester awd 2.5 2008 4 auto(l4) 4 18 23 p suv
nissan altima 2.4 1999 4 manual(m5) f 21 29 r compact
subaru impreza awd 2.5 2008 4 manual(m5) 4 20 27 r compact
ford explorer 4wd 4.6 2008 8 auto(l6) 4 13 19 r suv
dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup
nissan altima 3.5 2008 6 manual(m6) f 19 27 p midsize
jeep grand cherokee 4wd 6.1 2008 8 auto(l5) 4 11 14 p suv
volkswagen passat 1.8 1999 4 manual(m5) f 21 29 p midsize
nissan maxima 3.5 2008 6 auto(av) f 19 25 p midsize
hyundai tiburon 2.7 2008 6 auto(l4) f 17 24 r subcompact
subaru impreza awd 2.5 2008 4 auto(s4) 4 20 27 r compact

Example: Do cars with big engines use more fuel than cars with small engines?

  1. Start with mapping and aes()
  • displ. Engine displacement, in litres (bigger numbers = greater engine size)
  • hwy. Highway miles per gallon (bigger numbers = using less fuel each mile)
    • You are mapping aesthetics onto data
    • x-axis position maps onto displ
    • y-axis position maps onto hwy
    • x-position and y-position are aesthetics: things that we can perceive on the graphic
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))

Example: Do cars with big engines use more fuel than cars with small engines?

  1. Add a layer, like geom_point()
  • Layers determine physical representations of data

Source. A layered grammar of graphics

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

Example: Which types of cars have big engines but decent highway mpg?

  1. Add a new aesthetic mapping, like color
    • class. “type” of car
    • Color maps onto class

Source. A layered grammar of graphics

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
  geom_point()

Example: Which types of cars use more fuel as their engines get bigger?

  1. Add a faceting specification with facet_wrap(facets)
    • class. “type” of car
    • facets. facets = ~ “categorical variable name”
    • This “formula” splits your plot into facets, which are subsets of data based on categorical variables (preferably)

Source. Data visualization: facets

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(facets = ~ class)

Exercise: Play with the sleepstudy dataset that comes with lme4

  • sleepstudy. “The average reaction time per day for subjects in a sleep deprivation study. On day 0 the subjects had their normal amount of sleep. Starting that night they were restricted to 3 hours of sleep per night. The observations represent the average reaction time on a series of tests given each day to each subject.”

Source. help("sleepstudy")

sleepstudy %>% sample_n(size = 15)
Reaction Days Subject
254.4907 2 335
277.8989 2 371
265.4731 0 334
304.6336 5 351
430.5853 8 308
364.1236 9 372
362.9184 7 350
283.8424 0 333
266.5955 7 351
213.6303 6 309
285.1330 3 330
242.8118 1 332
382.2038 6 308
276.2012 1 334
336.2806 8 349

Exercise: Play with the sleepstudy dataset that comes with lme4

  • sleepstudy.
    • does average reaction time change over time?
    • does change over time depend on Subject?
    • hint. help("facet_wrap")

Source. help("sleepstudy")

See below for plot exercise solution

  • Look at this after you’ve tried to solve yourself
ggplot(data = sleepstudy, mapping = aes(x = Days, y = Reaction)) +
  geom_point() +
  facet_wrap(facets = ~ Subject)

Example: Which type of car has better mpg in the city, on average?

  • cty. City miles per gallon
    • Use new x- and y-axis aesthetics with aes()
    • Now x-position maps onto class
    • And y-position maps onto cty
ggplot(data = mpg, mapping = aes(x = class, y = cty))

Which type of car has better mpg in the city, on average?

  • Add a boxplot layer
    • Boxplots are a complex layer
    • e.g., They require statistics like fences, hinges, and outliers in addition to the shapes we see

Source. A layered grammar of graphics

ggplot(data = mpg, mapping = aes(x = class, y = cty)) +
  geom_boxplot()

Which type of car has better mpg in the city, on average?

  • Add a statistical summary layer
    • Normative bars representing means + 95% confidence intervals
    • stat_summary(): “Summarise y values at unique/binned x”

Sources. help("stat_summary")

ggplot(data = mpg, mapping = aes(x = class, y = cty)) +
  stat_summary(geom = "bar", fun.data = mean_cl_normal)

Which type of car has better mpg in the city, on average?

  • Add a statistical summary layer
    • Normative bars representing means + 95% confidence intervals
    • stat_summary(): “Summarise y values at unique/binned x”
    • mean_cl_normal(): “… computes 3 summary variables: the sample mean and lower and upper Gaussian confidence limits based on the t-distribution.”

Sources. help("stat_summary") and help("mean_cl_normal")

ggplot(data = mpg, mapping = aes(x = class, y = cty)) +
  stat_summary(geom = "bar", fun.data = mean_cl_normal) +
  stat_summary(geom = "errorbar", fun.data = mean_cl_normal, width = 0)

Which type of car has better mpg in the city, on average?

  • cty. City miles per gallon
  • Standard bars representing means + 95% confidence intervals
    • stat_summary(): “Summarise y values at unique/binned x”
    • mean_cl_normal(): “… computes 3 summary variables: the sample mean and lower and upper Gaussian confidence limits based on the t-distribution.”
  • Fill the bars with fill = white, and color their outline with color = “black”

Sources. help("stat_summary") and help("mean_cl_normal")

ggplot(data = mpg, mapping = aes(x = class, y = cty)) +
  stat_summary(geom = "bar", fun.data = mean_cl_normal, fill = "white", color = "black") +
  stat_summary(geom = "errorbar", fun.data = mean_cl_normal, width = 0)

Exercise: Play with the US_births_2000_2014 dataset that comes with the fivethirtyeight package

Source. help("US_births_2000_2014")

Exercise: Play with the US_births_2000_2014 dataset that comes with the fivethirtyeight package

US_births_2000_2014 %>% sample_n(size = 15)
year month date_of_month date day_of_week births
2008 11 7 2008-11-07 Fri 13203
2012 3 4 2012-03-04 Sun 7021
2013 4 1 2013-04-01 Mon 10440
2008 6 3 2008-06-03 Tues 13459
2001 5 26 2001-05-26 Sat 8911
2007 12 17 2007-12-17 Mon 13455
2007 6 3 2007-06-03 Sun 7813
2012 2 1 2012-02-01 Wed 11916
2002 11 11 2002-11-11 Mon 11509
2011 8 15 2011-08-15 Mon 12756
2014 3 9 2014-03-09 Sun 6789
2006 8 31 2006-08-31 Thurs 14391
2011 8 10 2011-08-10 Wed 13160
2003 1 12 2003-01-12 Sun 7297
2008 12 25 2008-12-25 Thurs 6471

Exercise: Play with the US_births_2000_2014 dataset that comes with the fivethirtyeight package

  • Do births change over the week?
  • Does the weekly pattern of births change over the years?

Source. help("US_births_2000_2014")

See below for plot exercise solution

  • Look at this after you’ve tried to solve yourself
ggplot(data = US_births_2000_2014, mapping = aes(x = day_of_week, y = births)) +
  geom_point() +
  facet_wrap(facets = ~ year)

More Exercises

  • Datasets
    • data()
    • data(package = "ggplot2")
    • data(package = "fivethirtyeight")
  • using datasets from above, complete these exercises:
    • Generate a boxplot with geom_boxplot()
    • Generate a scatterplot with geom_point()
    • Generate bars of means with 95% confidence intervals with stat_summary()
    • Generate a QQ-plot with geom_qq()
    • Generate a histogram with geom_histogram()
    • Facet your data with facet_wrap()
    • Use the help() function, the tidyverse website, and Google (e.g., “qq-plot in ggplot2”) to troubleshoot problems

Resources